Skip to content

feat: JudgeLLM evaluation with ProposalAmender#248

Open
rioloc wants to merge 1 commit into
lightspeed-core:mainfrom
rioloc:feat/judge-evaluation
Open

feat: JudgeLLM evaluation with ProposalAmender#248
rioloc wants to merge 1 commit into
lightspeed-core:mainfrom
rioloc:feat/judge-evaluation

Conversation

@rioloc
Copy link
Copy Markdown
Collaborator

@rioloc rioloc commented May 27, 2026

Depends on #232 — this PR must be merged after #232.

Summary

This PR adds LLM-as-judge evaluation and enriched data capture for the ProposalDriver evaluation pipeline, along with per-scenario test infrastructure and a new CrashLoopBackOff test fixture.

Core changes

  1. ProposalAmender — after a Proposal CR reaches terminal state, fetches child Result CRs (AnalysisResult, ExecutionResult, VerificationResult, EscalationResult) from the cluster and enriches TurnData with:

    • proposal_results: structured dict with the complete .status of each child CR
    • response: a Markdown workflow summary suitable for both human review and LLM-as-judge evaluation
  2. custom:proposal_evaluation_correctness — a new LLM-as-judge metric (score 0–1) with a multi-dimensional evaluation prompt:

    • SRE persona: judge evaluates as a senior Site Reliability Engineer on OpenShift/Kubernetes
    • Per-dimension scoring: separate scores for Diagnosis, Action, and Verification (+ average), with N/A for absent dimensions
    • Reasoning-before-score: forces the LLM to reason before committing to a number, reducing post-hoc rationalization
    • 3 calibration examples: anchors scoring on concrete K8s scenarios (high score, low score, infrastructure failure with correct diagnosis)
    • Dedicated parser (_parse_proposal_eval_response): extracts sub-scores and average from the multi-dimensional output format
  3. Per-scenario test infrastructure — setup/cleanup scripts refactored from monolithic per-provider scripts to:

    • Shared infra scripts (_setup_infra-openai.sh, _setup_infra-claude-vertex.sh) sourced by scenario scripts
    • Per-scenario setup/cleanup (setup_oomkill-openai.sh, setup_crashloop_probe-openai.sh, etc.)
    • New crashloop-probe-demo fixture (nginx with misconfigured liveness probe at /nonexistent-health)
  4. Shellcheck complianceexport for variables consumed by sourced scripts (SC2034), exclude SC1091 for dynamic source paths in Makefile

Why

The existing ProposalDriver only extracts condition message fields via _extract_summary, losing the rich structured data from child Result CRs. In particular, the Diagnosis from AnalysisResult (root cause analysis, confidence level, detailed summary) is never captured, making it impossible to evaluate whether the agentic workflow diagnosed and remediated the issue correctly.

custom:proposal_status provides deterministic pass/fail on workflow phase, but cannot assess the quality of diagnosis, actions, or verification. The new proposal_evaluation_correctness metric fills this gap using an LLM judge with a structured, multi-dimensional prompt.

Design choices

CLIClient abstraction

CLI operations (run, get_resource, apply, delete) are extracted into a CLIClient ABC with a KubeCLI implementation. Both ProposalDriver and ProposalAmender use the same interface; tests inject a mock CLIClient without patching subprocess internals.

ProposalAmender as a separate class

Follows the APIDataAmender pattern — a dedicated class composed into the driver, responsible for enriching TurnData in-place. Navigates proposal_status.steps.<step>.results[] to read StepResultRef entries, then fetches each child CR via CLIClient.get_resource().

Multi-dimensional judge prompt

The prompt produces per-dimension scores instead of a single holistic score:

  • Diagnosis, Action, Verification scored independently (0.0–1.0 or N/A)
  • Average computed from present dimensions
  • Calibration Example C explicitly anchors the "correct diagnosis but infrastructure failure" edge case — Action is marked N/A when execution fails for infra reasons (timeout, sandbox crash), not penalized as agent reasoning failure

Per-scenario scripts

Each scenario (oomkill, crashloop-probe) × provider (openai, claude-vertex) has its own setup/cleanup script that sources a shared _setup_infra-{provider}.sh / _cleanup_infra-{provider}.sh. This avoids deploying all fixtures for every conversation and makes adding new scenarios mechanical.

Relationship between the two proposal metrics

Metric Type Evaluates Requires
custom:proposal_status Deterministic Phase matches expected (Completed/Failed/Denied/Escalated) expected_proposal_status
custom:proposal_evaluation_correctness LLM judge Quality of diagnosis, actions, verification response (from amender) + expected_response

They are complementary: proposal_status checks what happened, proposal_evaluation_correctness checks how well it was done.

Test plan

  • Existing test_proposal_driver.py tests pass unchanged (regression)
  • ProposalAmender unit tests: analysis-only, analysis+execution, full pipeline, failed step, empty results, Markdown summary formatting
  • proposal_evaluation_correctness metric unit tests: mock LLM, multi-dimensional score parsing, missing response handling, conversation-level skip, SRE persona verification
  • _parse_proposal_eval_response parser unit tests: all dimensions, N/A dimensions, fallback average computation, unparseable input
  • Integration tests updated: test_oomkill_full_lifecycle, test_analysis_only, test_oomkill_claude_vertex
  • New crashloop-probe-demo fixture and integration eval data for both providers
  • Shellcheck passes with per-scenario scripts
  • make pre-commit && make test green

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added new custom metric: custom:proposal_evaluation_correctness (LLM-judge) with 0.75 threshold (disabled by default).
  • Documentation

    • README and Evaluation Guide updated with metric definition, scoring rules, required fields, example config, and quick-reference entry.
  • Enhancements

    • Richer, structured proposal evaluation summaries.
    • New pipeline component to build proposal summaries and a CLI-based cluster client for resource ops.
    • Updated integration scenarios and setup/cleanup scripts for OOMKill and CrashLoopBackOff.
  • Tests

    • Expanded unit/integration tests covering metric parsing, proposal amender, CLI timeouts, and outcome routing.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds derive_phase(), a CLIClient/KubeCLI, and a ProposalAmender that aggregates Result CRs and renders workflow summaries; refactors ProposalDriver to use these components; registers a new LLM-judged metric proposal_evaluation_correctness with prompt and JSON parsing; updates config, docs, tests, and integration scripts.

Changes

Proposal Evaluation and Enrichment Pipeline

Layer / File(s) Summary
Phase derivation extraction and public API
src/lightspeed_evaluation/core/proposal/phase.py, src/lightspeed_evaluation/core/proposal/__init__.py, src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py, tests/unit/core/metrics/custom/test_proposal_eval.py
Extracts phase derivation into derive_phase() and replaces local _derive_phase usages across metrics and tests.
TurnData model extension for proposal results
src/lightspeed_evaluation/core/models/data.py, src/lightspeed_evaluation/core/system/validator.py, config/system.yaml, docs/EVALUATION_GUIDE.md, README.md
Adds optional expected_outcome, per-phase expected outcome fields, and proposal_results; registers custom:proposal_evaluation_correctness in config and validator and documents the metric.
Prompts and metric contracts
src/lightspeed_evaluation/core/metrics/custom/prompts.py, docs/EVALUATION_GUIDE.md, README.md
Add PROPOSAL_EVALUATION_CORRECTNESS_PROMPT (SRE-style rubric), remove duplicate intent example line, and document metric contract and threshold.
LLM-as-judge metric: proposal_evaluation_correctness
src/lightspeed_evaluation/core/metrics/custom/custom.py, tests/unit/core/metrics/custom/test_custom.py
Adds evaluator _evaluate_proposal_evaluation_correctness, JSON parsing helpers, prompt wiring, metric registration in supported_metrics, and unit tests covering parsing, N/A handling, prompts, and LLM error cases.
Kubernetes CLI abstraction and implementation
src/lightspeed_evaluation/pipeline/evaluation/cli.py, tests/unit/pipeline/evaluation/test_proposal_amender.py
Adds CLIClient ABC and KubeCLI implementing run/get_resource/apply/delete with timeout handling and JSON parsing; covered by timeout-related unit tests.
ProposalAmender: workflow enrichment from child CRs
src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py, tests/unit/pipeline/evaluation/test_proposal_amender.py
Implements ProposalAmender to fetch per-step Result CRs via CLIClient.get_resource, populate turn_data.proposal_results, and render a Markdown workflow summary (Request, Analysis, Execution, Verification, Escalation, Outcome). Tests cover analysis-only, full pipeline, edge cases, and error handling.
ProposalDriver refactoring: CLI abstraction and amender integration
src/lightspeed_evaluation/pipeline/evaluation/driver.py, tests/unit/pipeline/evaluation/test_driver.py
Refactors ProposalDriver to construct KubeCLI and ProposalAmender, delegate cluster operations to KubeCLI, call ProposalAmender.amend() for enrichment, use derive_phase() for terminal detection, and centralize terminal outcome mapping. Adjusts driver tests and config validation for cli_timeout.
Integration infra, fixtures, and scripts
tests/integration/agentic/fixtures/*, tests/integration/agentic/scripts/*, tests/integration/system-config-agents-proposal.yaml, Makefile
Adds crashloop deployment fixture, provider-specific setup/cleanup scripts, increases proposal agent timeout, restructures setup flow to separate infra from workload scripts, and tweaks Makefile shellcheck flags.
Integration evaluation data and tests
tests/integration/test_evaluation_data_proposal.yaml, tests/integration/test_proposal_evaluation.py
Restructures evaluation datasets into analysis-only and provider-specific full-lifecycle groups, adds crashloop scenarios, includes custom:proposal_evaluation_correctness thresholds, and updates integration tests/timeouts.

Sequence Diagram(s)

sequenceDiagram
  participant ProposalDriver
  participant ProposalAmender
  participant KubeCLI
  participant ResultCR
  participant Phase as derive_phase
  ProposalDriver->>KubeCLI: apply(...) / get status / delete(...)
  ProposalDriver->>ProposalAmender: amend(turn_data, proposal_status)
  ProposalAmender->>KubeCLI: get_resource(resource_plural, name)
  KubeCLI->>ResultCR: oc get ... -o json
  ResultCR-->>KubeCLI: status JSON
  KubeCLI-->>ProposalAmender: parsed status
  ProposalAmender->>ProposalAmender: aggregate into proposal_results & build_summary
  ProposalAmender-->>ProposalDriver: enriched turn_data.response
  ProposalDriver->>Phase: derive_phase(conditions, proposal_spec)
  Phase-->>ProposalDriver: terminal phase
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • asamal4
  • VladimirKadlec
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: JudgeLLM evaluation with ProposalAmender' clearly summarizes the main changes: adding LLM-as-judge evaluation capabilities via a new ProposalAmender component. The changes span metric evaluation, data enrichment, and infrastructure, all of which are accurately reflected in this concise title.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Fixed
Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Fixed
@rioloc rioloc force-pushed the feat/judge-evaluation branch 2 times, most recently from 354b221 to 5c2f4b1 Compare May 28, 2026 10:58

import json
import os
import subprocess
@rioloc rioloc force-pushed the feat/judge-evaluation branch from ed62f0f to 5492575 Compare May 28, 2026 14:36
@rioloc rioloc marked this pull request as ready for review May 29, 2026 10:18
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 5492575 to c47a46f Compare May 29, 2026 10:20
@rioloc
Copy link
Copy Markdown
Collaborator Author

rioloc commented May 29, 2026

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/integration/test_evaluation_data_proposal.yaml (1)

95-104: 💤 Low value

expected_response is likely unused for this metric.

custom:proposal_evaluation_correctness is a turn-level LLM-as-judge metric that requires only response (no ground truth), so expected_response here will not be consulted during scoring. It's harmless but can mislead readers into thinking the judge compares against it. Consider dropping it or adding a brief comment clarifying it's documentation-only.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_evaluation_data_proposal.yaml` around lines 95 - 104,
The test includes an unnecessary expected_response alongside the turn-level
judge metric custom:proposal_evaluation_correctness (defined in turn_metrics and
turn_metrics_metadata) which doesn't use ground truth; remove the
expected_response block from the test or, if you want to keep it for
human-readable documentation, add a short inline comment next to
expected_response stating it is documentation-only and not used by the
custom:proposal_evaluation_correctness metric so readers aren’t misled.
tests/integration/test_proposal_evaluation.py (1)

204-236: ⚡ Quick win

Test asserts less than its docstring claims.

The docstring states this verifies that custom:proposal_evaluation_correctness runs against the response and the pipeline completes, but the body only checks that turn.response is populated — identical to test_full_lifecycle's response check. Nothing confirms the judge metric actually produced a result.

Since live LLM scores are nondeterministic, asserting a specific score is fragile, but you can confirm the metric path executed by checking the emitted results (e.g., the JSON output written to tmp_path / "eval_output") contain a custom:proposal_evaluation_correctness entry for the turn. This makes the test meaningfully distinct from the lifecycle test.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_proposal_evaluation.py` around lines 204 - 236, The
test_judge_evaluation currently only asserts that ProposalDriver populated
turn.response; update it to also verify that the judge metric ran by reading the
evaluation output written to the configured storage (the FileBackend output_dir
set to tmp_path / "eval_output") after calling evaluate(system_config,
eval_data) and assert that a result entry for
custom:proposal_evaluation_correctness exists and is associated with the
evaluated turn; locate this logic near the test_judge_evaluation function and
use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the
turn from eval_data[0].turns[0]) to load the JSON results and assert the
presence of the custom:proposal_evaluation_correctness metric for that turn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py`:
- Around line 61-75: The KubeCLI.run method can raise subprocess.TimeoutExpired
which escapes callers like KubeCLI.get_resource and ProposalAmender.amend;
modify KubeCLI.run to catch subprocess.TimeoutExpired and normalize it by
returning a failing subprocess.CompletedProcess (non-zero returncode, empty
stdout, stderr describing the timeout and including the timeout value/command)
so callers always receive a CompletedProcess rather than an exception; update
any references in get_resource/ProposalAmender.amend to rely on CompletedProcess
return semantics (or alternatively, raise the project-specific EvaluationError
consistently if your codebase prefers exceptions).

In `@src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`:
- Around line 36-39: The try/except in ProposalAmender that calls self._do_amend
currently only catches KeyError/TypeError/ValueError and therefore misses
subprocess errors from self._cli.get_resource (via KubeCLI.run); update the
except clause in ProposalAmender.execute (the block wrapping self._do_amend) to
also catch subprocess.SubprocessError and subprocess.TimeoutExpired (or broaden
to Exception if preferred), or alternatively normalize CLI exceptions inside
KubeCLI.run/_cli.get_resource so they raise a common custom exception that
ProposalAmender can catch; reference _do_amend, ProposalAmender,
_cli.get_resource, KubeCLI.run and ProposalDriver.execute_turn when making the
change.
- Line 80: Remove the stray stdout dump by replacing the print call that outputs
turn_data.response with structured logging: call logger.debug(...) (using the
module logger or create one via logging.getLogger(__name__) if absent) so the
Markdown summary is logged at debug level instead of printed; update the
location where print(turn_data.response) appears in proposal_amender.py (the
code handling turn_data response/amend flow) to use logger.debug and ensure
imports/logger declaration are present.

---

Nitpick comments:
In `@tests/integration/test_evaluation_data_proposal.yaml`:
- Around line 95-104: The test includes an unnecessary expected_response
alongside the turn-level judge metric custom:proposal_evaluation_correctness
(defined in turn_metrics and turn_metrics_metadata) which doesn't use ground
truth; remove the expected_response block from the test or, if you want to keep
it for human-readable documentation, add a short inline comment next to
expected_response stating it is documentation-only and not used by the
custom:proposal_evaluation_correctness metric so readers aren’t misled.

In `@tests/integration/test_proposal_evaluation.py`:
- Around line 204-236: The test_judge_evaluation currently only asserts that
ProposalDriver populated turn.response; update it to also verify that the judge
metric ran by reading the evaluation output written to the configured storage
(the FileBackend output_dir set to tmp_path / "eval_output") after calling
evaluate(system_config, eval_data) and assert that a result entry for
custom:proposal_evaluation_correctness exists and is associated with the
evaluated turn; locate this logic near the test_judge_evaluation function and
use the same identifiers (system_config, evaluate, eval_data, tmp_path, and the
turn from eval_data[0].turns[0]) to load the JSON results and assert the
presence of the custom:proposal_evaluation_correctness metric for that turn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e2a40ded-3b7b-4b8b-9ed7-9c9473a5bf72

📥 Commits

Reviewing files that changed from the base of the PR and between 9ada2a8 and c47a46f.

📒 Files selected for processing (20)
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/proposal/__init__.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/test_proposal_evaluation.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py

Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py Outdated
@rioloc rioloc force-pushed the feat/judge-evaluation branch 2 times, most recently from 37b4196 to 4a5dc4c Compare May 29, 2026 17:03
Comment thread src/lightspeed_evaluation/pipeline/evaluation/cli.py Dismissed
Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py Dismissed
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 4a5dc4c to 20926e7 Compare May 29, 2026 17:09
Copy link
Copy Markdown
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you !!

Comment thread src/lightspeed_evaluation/core/system/validator.py Outdated
Comment thread src/lightspeed_evaluation/pipeline/evaluation/driver.py
Comment thread src/lightspeed_evaluation/core/metrics/custom/prompts.py Outdated
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 20926e7 to 0db3bf8 Compare June 1, 2026 13:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
tests/integration/agentic/scripts/_setup_infra-openai.sh (1)

34-44: 💤 Low value

Optional: align secret creation with the safer oc create secret --dry-run pattern.

Interpolating ${OPENAI_API_KEY} directly into the heredoc YAML can break if the value ever contains YAML-special characters. The Claude/Vertex infra script already uses oc create secret … --dry-run=client -o yaml | oc apply -f -, which sidesteps escaping. Consider using the same approach here for consistency and robustness.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/agentic/scripts/_setup_infra-openai.sh` around lines 34 -
44, Replace the heredoc that directly interpolates ${OPENAI_API_KEY} with the
safer oc create secret --dry-run=client -o yaml pattern: instead of the cat
<<EOF | oc apply ... block that creates the Secret named eval-llm-credentials,
call oc create secret generic eval-llm-credentials with --from-literal entries
for OPENAI_API_KEY, LIGHTSPEED_AGENT_PROVIDER and OPENAI_MODEL (using the
AGENT_MODEL and OPENAI_API_KEY vars), add --dry-run=client -o yaml, and pipe
that output into oc apply -n "$OPERATOR_NS" -f -; update the script to remove
the heredoc and use this create|apply pipeline so values containing YAML-special
chars are correctly handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/custom.py`:
- Around line 326-355: The parsing logic in _parse_proposal_eval_response uses
diagnosis/action/verification/average directly from the JSON and
formats/arithmetic-uses them, but _try_parse_float is unused so string numbers
break f"{v:.2f}" and sum(present)/len(present); update
_parse_proposal_eval_response to call _try_parse_float on each of
data.get("diagnosis"), data.get("action"), data.get("verification"), and
data.get("average") (store back into sub_scores and average), so present becomes
a list of floats and formatting with f"{v:.2f}" and average calculations are
safe; ensure _try_parse_float accepts generic input (str/None/number) and
returns Optional[float] as already defined.

In `@src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py`:
- Around line 126-129: The loop that builds option headings currently hardcodes
"(Approved)" for idx == 0; change it to derive approval from explicit signals:
inspect each option dict for common approval keys (e.g., "approved",
"is_approved", "selected") and/or consult the surrounding proposal
status/analysis_results (e.g., a "selected_index" or "approved_option" field)
and set label = "(Approved)" only when such an explicit indicator is true/
matches this idx; if no explicit indicator exists, omit the label. Update the
logic in the for idx, option in enumerate(options) block (and any code that
constructs options/analysis_results) to prefer these explicit fields rather than
relying on idx == 0 as a fallback.

---

Nitpick comments:
In `@tests/integration/agentic/scripts/_setup_infra-openai.sh`:
- Around line 34-44: Replace the heredoc that directly interpolates
${OPENAI_API_KEY} with the safer oc create secret --dry-run=client -o yaml
pattern: instead of the cat <<EOF | oc apply ... block that creates the Secret
named eval-llm-credentials, call oc create secret generic eval-llm-credentials
with --from-literal entries for OPENAI_API_KEY, LIGHTSPEED_AGENT_PROVIDER and
OPENAI_MODEL (using the AGENT_MODEL and OPENAI_API_KEY vars), add
--dry-run=client -o yaml, and pipe that output into oc apply -n "$OPERATOR_NS"
-f -; update the script to remove the heredoc and use this create|apply pipeline
so values containing YAML-special chars are correctly handled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7824cbcf-518b-42ff-8fdf-9d6e338f0123

📥 Commits

Reviewing files that changed from the base of the PR and between c47a46f and 0db3bf8.

📒 Files selected for processing (37)
  • Makefile
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/proposal/__init__.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_oomkill-openai.sh
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/test_proposal_evaluation.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
✅ Files skipped from review due to trivial changes (6)
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • src/lightspeed_evaluation/core/proposal/init.py
  • docs/EVALUATION_GUIDE.md
  • README.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • src/lightspeed_evaluation/core/system/validator.py
  • tests/unit/core/metrics/custom/test_custom.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py

Comment thread src/lightspeed_evaluation/core/metrics/custom/custom.py
Comment thread src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 0db3bf8 to 79d3c09 Compare June 1, 2026 14:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/lightspeed_evaluation/pipeline/evaluation/driver.py (1)

143-148: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor enabled=False before creating cluster dependencies.

validate_config() explicitly skips the CLI presence check when the driver is disabled, but this constructor still eagerly builds KubeCLI and ProposalAmender. That leaves ProposalDriver inconsistent with HttpApiDriver: a disabled proposal driver can still reach runtime code paths that try to use an empty CLI path instead of behaving like a no-op.

Proposed fix
 class ProposalDriver(AgentDriver):
     """Driver that manages Proposal CR lifecycle via oc/kubectl CLI."""
 
     def __init__(self, config: dict[str, Any], *, enabled: bool = True) -> None:
         """Initialize the proposal driver."""
         super().__init__(config, enabled=enabled)
         self._cli = self._resolve_cli()
-        self._kube_cli = KubeCLI(
-            cli_path=self._cli,
-            namespace=self._config.namespace,
-            timeout=self._config.cli_timeout,
-        )
-        self._amender = ProposalAmender(self._kube_cli)
+        self._kube_cli = None
+        self._amender = None
+        if enabled:
+            self._kube_cli = KubeCLI(
+                cli_path=self._cli,
+                namespace=self._config.namespace,
+                timeout=self._config.cli_timeout,
+            )
+            self._amender = ProposalAmender(self._kube_cli)
@@
     def execute_turn(
         self, turn_data: TurnData, conversation_id: Optional[str] = None
     ) -> tuple[Optional[str], Optional[str]]:
         """Execute a Proposal CR lifecycle for a single turn."""
+        if not self._enabled or self._kube_cli is None or self._amender is None:
+            return None, conversation_id
+
         # Proposal CR lifecycle:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py` around lines 143 -
148, The constructor currently always instantiates KubeCLI and ProposalAmender
even when the driver is disabled; change the ProposalDriver constructor to only
create self._kube_cli = KubeCLI(...) and self._amender = ProposalAmender(...)
when the driver is enabled (e.g., if self._config.enabled: ...), otherwise set
those attributes to None (or leave uninitialized) so no cluster dependencies are
created for a disabled driver, and update any methods that reference
self._kube_cli or self._amender to guard for None or perform lazy
initialization.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_proposal_driver.py (1)

624-625: ⚡ Quick win

Assert denied turns still produce a response.

Denied is now a non-error path, so downstream evaluation will still consume the amended summary. This test only checks proposal_status, which would miss regressions where denied turns stop populating turn.response.

Proposed test addition
         error, _ = driver.execute_turn(turn)
 
         assert error is None
+        assert turn.response
         assert turn.proposal_status == status
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/pipeline/evaluation/test_proposal_driver.py` around lines 624 -
625, The test currently only asserts error is None and turn.proposal_status ==
status, but when status is Denied we must also ensure the turn still produces a
response; update the test in test_proposal_driver.py to assert that
turn.response is populated for denied turns (e.g., assert turn.response is not
None and/or assert turn.response.text or length > 0) so regressions that drop
the amended summary are caught; reference the existing locals error, turn, and
proposal_status/status when adding the extra assertion(s).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py`:
- Around line 247-250: The current warning treats any non-COMPLETED
TerminalOutcome (variable outcome) as a warning; change the condition around
logger.warning in driver.py so that TerminalOutcome.DENIED is considered a
normal terminal state and does not emit a warning. Concretely, update the check
that uses TerminalOutcome.COMPLETED (referenced alongside cr_name and
logger.warning) to only warn when outcome is neither TerminalOutcome.COMPLETED
nor TerminalOutcome.DENIED (or use an explicit if outcome in
[TerminalOutcome.COMPLETED, TerminalOutcome.DENIED]: skip warning; else: log
warning).

In `@tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh`:
- Around line 14-16: The problem is that the unconditional "|| true" after the
oc wait call (oc wait --for=condition=Available=false
deployment/crashloop-probe-demo -n "$TEST_NS" --timeout=60s 2>/dev/null || true)
masks failures and allows the script to print "Setup complete." even when oc
wait fails; remove the "|| true", check the exit status of oc wait, and on
failure print a clear error including "$TEST_NS" and the oc output (avoid
redirecting stderr to /dev/null) and exit with a non-zero status so downstream
tests fail fast; only proceed to the sleep 10 and echo "Setup complete." when oc
wait succeeds.

---

Duplicate comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/driver.py`:
- Around line 143-148: The constructor currently always instantiates KubeCLI and
ProposalAmender even when the driver is disabled; change the ProposalDriver
constructor to only create self._kube_cli = KubeCLI(...) and self._amender =
ProposalAmender(...) when the driver is enabled (e.g., if self._config.enabled:
...), otherwise set those attributes to None (or leave uninitialized) so no
cluster dependencies are created for a disabled driver, and update any methods
that reference self._kube_cli or self._amender to guard for None or perform lazy
initialization.

---

Nitpick comments:
In `@tests/unit/pipeline/evaluation/test_proposal_driver.py`:
- Around line 624-625: The test currently only asserts error is None and
turn.proposal_status == status, but when status is Denied we must also ensure
the turn still produces a response; update the test in test_proposal_driver.py
to assert that turn.response is populated for denied turns (e.g., assert
turn.response is not None and/or assert turn.response.text or length > 0) so
regressions that drop the amended summary are caught; reference the existing
locals error, turn, and proposal_status/status when adding the extra
assertion(s).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 384b0716-bdea-4ee2-9c2c-4f4efc9b7bcc

📥 Commits

Reviewing files that changed from the base of the PR and between 0db3bf8 and 79d3c09.

📒 Files selected for processing (37)
  • Makefile
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/proposal/__init__.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_oomkill-openai.sh
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/test_proposal_evaluation.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
✅ Files skipped from review due to trivial changes (5)
  • README.md
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/system/validator.py
🚧 Files skipped from review as they are similar to previous changes (26)
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/setup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • src/lightspeed_evaluation/core/models/data.py
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • tests/integration/test_proposal_evaluation.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • src/lightspeed_evaluation/core/proposal/init.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • src/lightspeed_evaluation/core/proposal/phase.py
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/unit/core/metrics/custom/test_custom.py
  • config/system.yaml
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py

Comment thread src/lightspeed_evaluation/pipeline/evaluation/driver.py
Comment thread tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 784ffa8 to 8e3d6bb Compare June 3, 2026 09:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lightspeed_evaluation/core/metrics/custom/prompts.py`:
- Around line 68-72: The scoring prompt is ambiguous about infra-blocked
executions (lines referencing Diagnosis/Action/Verification and Example C);
update the prompt text to state a single consistent rule: when execution cannot
run due to infrastructure failures, mark Action as N/A (not scored) rather than
emitting action = null, and clarify that only present/non-N/A dimensions are
averaged for the final score; explicitly reference the three dimensions
(Diagnosis, Action, Verification) and Example C so readers know this applies to
infra-blocked executions and to the scoring/averaging logic.
- Around line 48-110: The file defines a very long module-level template string
PROPOSAL_EVALUATION_CORRECTNESS_PROMPT that violates line-length lint rules;
instead of disabling pylint, refactor the constant into concatenated/dedented
string fragments (e.g., use implicit adjacent string concatenation or a
tuple/join of shorter strings) and reinsert the
{request}/{workflow_summary}/{expected_outcome} placeholders unchanged; update
the symbol PROPOSAL_EVALUATION_CORRECTNESS_PROMPT to be built from these shorter
pieces so no single source line exceeds the max length and remove any file-level
pylint disable comments.

In `@tests/integration/agentic/scripts/setup_oomkill-openai.sh`:
- Around line 12-15: The script currently uses "oc wait
--for=condition=Available=false deployment/oomkill-demo ... || true" which masks
failures and can return before the pod is actually OOMKilled/CrashLoopBackOff;
update the setup_oomkill-openai.sh logic to actively wait for the pod(s) of the
oomkill-demo deployment to reach the OOMKilled or CrashLoopBackOff state (e.g.,
poll "oc get pod -l app=oomkill-demo -n \"$TEST_NS\" -o jsonpath=..." and
inspect containerStatuses[].state.terminated.reason == \"OOMKilled\" or
containerStatuses[].state.waiting.reason == \"CrashLoopBackOff\"), loop with a
timeout, and remove the "|| true" so the script returns non-zero on timeout to
signal failure to processor.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e7f8c582-a9a1-40c7-b9ca-f80f3bcc0f35

📥 Commits

Reviewing files that changed from the base of the PR and between 79d3c09 and 8e3d6bb.

📒 Files selected for processing (37)
  • Makefile
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/proposal/__init__.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_oomkill-openai.sh
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/test_proposal_evaluation.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/system-config-agents-proposal.yaml
🚧 Files skipped from review as they are similar to previous changes (29)
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • README.md
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • src/lightspeed_evaluation/core/proposal/init.py
  • Makefile
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/system/validator.py
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • src/lightspeed_evaluation/core/models/data.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/integration/test_proposal_evaluation.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • src/lightspeed_evaluation/core/metrics/custom/custom.py

Comment thread src/lightspeed_evaluation/core/metrics/custom/prompts.py
Comment thread src/lightspeed_evaluation/core/metrics/custom/prompts.py Outdated
Comment thread tests/integration/agentic/scripts/setup_oomkill-openai.sh Outdated
@rioloc rioloc force-pushed the feat/judge-evaluation branch 3 times, most recently from 2b104d8 to 1022a96 Compare June 3, 2026 10:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh (1)

13-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll for the actual CrashLoopBackOff state instead of swallowing the wait failure.

oc wait --for=condition=Available=false ... || true followed by a fixed sleep 10 masks timeout failures and can return success before crashloop-probe-demo has actually entered CrashLoopBackOff. Downstream processing treats exit 0 as "scenario ready," so the proposal scenario can run against a not-yet-failing workload. This is the same weakness that was addressed in setup_oomkill-openai.sh by actively polling pod container state; please apply the equivalent here.

🔧 Suggested fix
-echo "Waiting for crashloop-probe-demo pod to appear..."
-oc wait --for=condition=Available=false deployment/crashloop-probe-demo \
-  -n "$TEST_NS" --timeout=60s 2>/dev/null || true
-sleep 10
-echo "Setup complete."
+echo "Waiting for crashloop-probe-demo pod to reach CrashLoopBackOff..."
+TIMEOUT=120
+INTERVAL=5
+ELAPSED=0
+while [ "$ELAPSED" -lt "$TIMEOUT" ]; do
+  WAITING_REASON=$(oc get pod -l app=crashloop-probe-demo -n "$TEST_NS" \
+    -o jsonpath='{.items[0].status.containerStatuses[0].state.waiting.reason}' 2>/dev/null || true)
+  if [ "$WAITING_REASON" = "CrashLoopBackOff" ]; then
+    echo "Pod reached expected state (waiting=$WAITING_REASON)."
+    echo "Setup complete."
+    exit 0
+  fi
+  sleep "$INTERVAL"
+  ELAPSED=$((ELAPSED + INTERVAL))
+done
+echo "ERROR: Timed out after ${TIMEOUT}s waiting for CrashLoopBackOff." >&2
+exit 1
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh`
around lines 13 - 17, Replace the current blind wait in
setup_crashloop_probe-claude-vertex.sh that uses "oc wait
--for=condition=Available=false deployment/crashloop-probe-demo ... || true" and
a fixed sleep with an active poll that checks the pod's container state until it
reaches CrashLoopBackOff (or times out); specifically, loop (with a configurable
timeout and short sleep interval) using oc get pods (or oc get pod -l
app=crashloop-probe-demo) and inspect the
pod.status.containerStatuses[*].state.waiting.reason for "CrashLoopBackOff" for
the crashloop-probe-demo pod in $TEST_NS, and fail the script if the timeout
expires — mirror the polling approach used in setup_oomkill-openai.sh but target
the crashloop-probe-demo pod/container.
🧹 Nitpick comments (3)
src/lightspeed_evaluation/pipeline/evaluation/cli.py (3)

12-50: 💤 Low value

Consider completing Google-style docstrings with Args sections.

The public API methods have basic docstrings but are missing Args sections. As per coding guidelines, public APIs in src/**/*.py should use Google-style docstrings, which typically include Args, Returns, and Raises sections.

Example for run():

def run(
    self,
    args: list[str],
    stdin: Optional[str] = None,
) -> subprocess.CompletedProcess[str]:
    """Run a CLI command and return the completed process.
    
    Args:
        args: Command arguments to pass to the CLI binary.
        stdin: Optional string to pass to the command's stdin.
    
    Returns:
        A CompletedProcess instance with captured stdout/stderr.
    """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py` around lines 12 - 50,
The docstrings on the public interface CLIClient methods (__init__, run,
get_resource, apply, delete) are missing Google-style sections; update each
method's docstring to include Args (describe parameters like timeout, args,
stdin, resource_plural, name, manifest), Returns (describe return types such as
subprocess.CompletedProcess[str] or tuple[dict[str, Any], Optional[str]]), and
Raises if applicable (e.g., any exceptions your implementations may propagate),
following the Google docstring format used across the repo so tools and readers
can parse parameter and return semantics consistently.

61-84: ⚡ Quick win

Add structured logging for timeout events.

The timeout handling correctly converts subprocess.TimeoutExpired into a failed CompletedProcess, but timeout events are not logged. As per coding guidelines, code in src/lightspeed_evaluation/**/*.py should use structured logging with appropriate log levels. Logging timeouts at this layer would improve observability for infrastructure issues.

📊 Suggested enhancement
+import logging
+
+logger = logging.getLogger(__name__)
+
+
 class CLIClient(ABC):
         except subprocess.TimeoutExpired:
+            logger.error(
+                "CLI command timed out after %ds: %s",
+                self._timeout,
+                " ".join([self._cli, *args]),
+            )
             cmd_str = " ".join([self._cli, *args])
             return subprocess.CompletedProcess(

As per coding guidelines: "Use structured logging with appropriate log levels in Python code" for src/lightspeed_evaluation/**/*.py.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py` around lines 61 - 84,
In the run method's except block for subprocess.TimeoutExpired, add structured
logging (use a module-level logger via logging.getLogger(__name__)) to record
the timeout event with fields like command (compose from self._cli and args),
timeout seconds (self._timeout), and exception info; log at warning or error
level before returning the subprocess.CompletedProcess, and ensure the module
imports logging and defines the logger variable so the log call can reference
it.

105-111: ⚡ Quick win

Add structured logging for JSON decode failures.

JSON decode failures indicate malformed output from the CLI, which is a significant issue worth logging at this layer. Currently, only the caller logs when it receives an error. As per coding guidelines, use structured logging in src/lightspeed_evaluation/**/*.py.

📊 Suggested enhancement
         try:
             return json.loads(result.stdout), None
         except json.JSONDecodeError as exc:
+            logger.error(
+                "Failed to parse JSON for %s/%s: %s. stdout: %r",
+                resource_plural,
+                name,
+                exc,
+                result.stdout[:200],
+            )
             return (
                 {},
                 f"Failed to parse JSON for {resource_plural}/{name}: {exc}",
             )

As per coding guidelines: "Use structured logging with appropriate log levels in Python code" for src/lightspeed_evaluation/**/*.py.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py` around lines 105 - 111,
Add structured logging in the JSON parse error branch so malformed CLI output is
logged with context before returning the error: catch json.JSONDecodeError in
the block that currently returns ({}, f"Failed to parse JSON for
{resource_plural}/{name}: {exc}") and call the module logger (the file's
module-level logger, e.g., logger or getLogger(__name__)) to emit an error-level
structured log containing fields resource_plural, name, stdout (or a truncated
version), and the exception/stacktrace, then return the same error tuple; update
imports if needed to ensure the logger is available.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/agentic/scripts/setup_oomkill-openai.sh`:
- Around line 16-30: The oc jsonpath calls inside the polling loop (where
TERMINATED_REASON and WAITING_REASON are set) can return non-zero early and
break the script under set -euo pipefail; change those command substitutions to
guard against failures (e.g., append || true or redirect failures to yield empty
string) so that TERMINATED_REASON and WAITING_REASON get assigned safely and the
loop continues polling until TIMEOUT, leaving the rest of the loop logic (the
echo/exit checks and ELAPSED increment) unchanged.

---

Duplicate comments:
In `@tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh`:
- Around line 13-17: Replace the current blind wait in
setup_crashloop_probe-claude-vertex.sh that uses "oc wait
--for=condition=Available=false deployment/crashloop-probe-demo ... || true" and
a fixed sleep with an active poll that checks the pod's container state until it
reaches CrashLoopBackOff (or times out); specifically, loop (with a configurable
timeout and short sleep interval) using oc get pods (or oc get pod -l
app=crashloop-probe-demo) and inspect the
pod.status.containerStatuses[*].state.waiting.reason for "CrashLoopBackOff" for
the crashloop-probe-demo pod in $TEST_NS, and fail the script if the timeout
expires — mirror the polling approach used in setup_oomkill-openai.sh but target
the crashloop-probe-demo pod/container.

---

Nitpick comments:
In `@src/lightspeed_evaluation/pipeline/evaluation/cli.py`:
- Around line 12-50: The docstrings on the public interface CLIClient methods
(__init__, run, get_resource, apply, delete) are missing Google-style sections;
update each method's docstring to include Args (describe parameters like
timeout, args, stdin, resource_plural, name, manifest), Returns (describe return
types such as subprocess.CompletedProcess[str] or tuple[dict[str, Any],
Optional[str]]), and Raises if applicable (e.g., any exceptions your
implementations may propagate), following the Google docstring format used
across the repo so tools and readers can parse parameter and return semantics
consistently.
- Around line 61-84: In the run method's except block for
subprocess.TimeoutExpired, add structured logging (use a module-level logger via
logging.getLogger(__name__)) to record the timeout event with fields like
command (compose from self._cli and args), timeout seconds (self._timeout), and
exception info; log at warning or error level before returning the
subprocess.CompletedProcess, and ensure the module imports logging and defines
the logger variable so the log call can reference it.
- Around line 105-111: Add structured logging in the JSON parse error branch so
malformed CLI output is logged with context before returning the error: catch
json.JSONDecodeError in the block that currently returns ({}, f"Failed to parse
JSON for {resource_plural}/{name}: {exc}") and call the module logger (the
file's module-level logger, e.g., logger or getLogger(__name__)) to emit an
error-level structured log containing fields resource_plural, name, stdout (or a
truncated version), and the exception/stacktrace, then return the same error
tuple; update imports if needed to ensure the logger is available.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ae9d29f3-b46a-4162-a022-dcc2f813a52a

📥 Commits

Reviewing files that changed from the base of the PR and between 8e3d6bb and 1022a96.

📒 Files selected for processing (37)
  • Makefile
  • README.md
  • config/system.yaml
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/proposal/__init__.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/cli.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_cleanup_infra-openai.sh
  • tests/integration/agentic/scripts/_setup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_oomkill-openai.sh
  • tests/integration/system-config-agents-proposal.yaml
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/integration/test_proposal_evaluation.py
  • tests/unit/core/metrics/custom/test_custom.py
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/unit/pipeline/evaluation/test_driver.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
💤 Files with no reviewable changes (2)
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-openai.sh
  • tests/integration/agentic/scripts/cleanup_proposal_fixtures-claude-vertex.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/system-config-agents-proposal.yaml
🚧 Files skipped from review as they are similar to previous changes (29)
  • tests/integration/agentic/scripts/cleanup_oomkill-openai.sh
  • tests/integration/agentic/scripts/cleanup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-openai.sh
  • config/system.yaml
  • tests/integration/agentic/scripts/_cleanup_infra-claude-vertex.sh
  • tests/integration/agentic/scripts/setup_crashloop_probe-openai.sh
  • src/lightspeed_evaluation/core/system/validator.py
  • tests/integration/agentic/fixtures/crashloop-probe-demo.yaml
  • tests/unit/pipeline/evaluation/test_driver.py
  • Makefile
  • tests/integration/agentic/scripts/setup_oomkill-claude-vertex.sh
  • tests/integration/agentic/scripts/cleanup_crashloop_probe-claude-vertex.sh
  • tests/integration/agentic/scripts/_setup_infra-openai.sh
  • docs/EVALUATION_GUIDE.md
  • src/lightspeed_evaluation/core/metrics/custom/prompts.py
  • tests/integration/test_evaluation_data_proposal.yaml
  • tests/unit/pipeline/evaluation/test_proposal_driver.py
  • src/lightspeed_evaluation/core/proposal/init.py
  • src/lightspeed_evaluation/core/proposal/phase.py
  • src/lightspeed_evaluation/core/models/data.py
  • src/lightspeed_evaluation/core/metrics/custom/proposal_eval.py
  • README.md
  • tests/unit/core/metrics/custom/test_proposal_eval.py
  • tests/integration/test_proposal_evaluation.py
  • src/lightspeed_evaluation/pipeline/evaluation/proposal_amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/driver.py
  • tests/unit/core/metrics/custom/test_custom.py
  • src/lightspeed_evaluation/core/metrics/custom/custom.py
  • tests/unit/pipeline/evaluation/test_proposal_amender.py

Comment thread tests/integration/agentic/scripts/setup_oomkill-openai.sh
Extract CLI operations (run, get_resource, apply, delete) into an
injectable CLIClient interface with KubeCLI implementation backed
by oc/kubectl. ProposalDriver now delegates to KubeCLI instead of
internal subprocess calls, enabling dependency injection for the
upcoming ProposalAmender.

ProposalAmender fetches AnalysisResult, ExecutionResult,
VerificationResult, and EscalationResult CRs via CLIClient and
populates turn_data.proposal_results with structured status data.
It also builds a Markdown workflow summary into turn_data.response.

- Add proposal_results field to TurnData model
- Create ProposalAmender with CLIClient dependency injection
- Integrate ProposalAmender into ProposalDriver (always enabled)
- Fallback to _extract_summary if amender fails

add custom:proposal_evaluation_correctness LLM-as-judge metric

New metric that evaluates agentic remediation workflow quality using
an LLM judge. Scores 0.0-1.0 based on diagnosis quality, action
appropriateness, risk management, and verification thoroughness.

- Add PROPOSAL_EVALUATION_CORRECTNESS_PROMPT template
- Register metric in CustomMetrics.supported_metrics
- Add METRIC_REQUIREMENTS entry (requires response field)
- Add metrics_metadata threshold (0.75) in system.yaml

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rioloc rioloc force-pushed the feat/judge-evaluation branch from 1022a96 to 7586d1a Compare June 3, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants